{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib\n",
    "import seaborn as sns\n",
    "matplotlib.rcParams['savefig.dpi'] = 144"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from static_grader import grader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PW Miniproject\n",
    "## Introduction\n",
    "\n",
    "The objective of this miniproject is to exercise your ability to use basic Python data structures, define functions, and control program flow. We will be using these concepts to perform some fundamental data wrangling tasks such as joining data sets together, splitting data into groups, and aggregating data into summary statistics.\n",
    "**Please do not use `pandas` or `numpy` to answer these questions.**\n",
    "\n",
    "We will be working with medical data from the British NHS on prescription drugs. Since this is real data, it contains many ambiguities that we will need to confront in our analysis. This is commonplace in data science, and is one of the lessons you will learn in this miniproject."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Downloading the data\n",
    "\n",
    "We first need to download the data we'll be using from Amazon S3:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mkdir: cannot create directory 'pw-data': File exists\r\n"
     ]
    }
   ],
   "source": [
    "!mkdir pw-data\n",
    "!aws s3 sync s3://dataincubator-wqu/pwdata-ease/ ./pw-data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the data\n",
    "\n",
    "The first step of the project is to read in the data. We will discuss reading and writing various kinds of files later in the course, but the code below should get you started."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gzip\n",
    "import simplejson as json"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# with gzip.open('./pw-data/201701scripts_sample.json.gz', 'rb') as f:\n",
    "#    scripts = json.load(f)\n",
    "\n",
    "# with gzip.open('./pw-data/practices.json.gz', 'rb') as f:\n",
    "#    practices = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "with gzip.open('./pw-data/201701scripts_sample_1.json.gz', 'rb') as f:\n",
    "    scripts = json.load(f)\n",
    "\n",
    "with gzip.open('./pw-data/201701scripts_sample_2.json.gz', 'rb') as f:\n",
    "    scripts += json.load(f)\n",
    "\n",
    "with gzip.open('./pw-data/practices.json.gz', 'rb') as f:\n",
    "    practices = json.load(f)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This data set comes from Britain's National Health Service. The `scripts` variable is a list of prescriptions issued by NHS doctors. Each prescription is represented by a dictionary with various data fields: `'practice'`, `'bnf_code'`, `'bnf_name'`, `'quantity'`, `'items'`, `'nic'`, and `'act_cost'`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'act_cost': 2.96,\n",
       "  'bnf_code': '1108010B0',\n",
       "  'bnf_name': 'Viscotears_Liq Gel 2mg/g',\n",
       "  'items': 2,\n",
       "  'nic': 3.18,\n",
       "  'practice': 'H81074',\n",
       "  'quantity': 20},\n",
       " {'act_cost': 31.76,\n",
       "  'bnf_code': '0205040D0',\n",
       "  'bnf_name': 'Doxazosin Mesil_Tab 2mg',\n",
       "  'items': 19,\n",
       "  'nic': 34.04,\n",
       "  'practice': 'J82072',\n",
       "  'quantity': 1288}]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scripts[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A [glossary of terms](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10686/Download-glossary-of-terms-for-GP-prescribing---presentation-level/pdf/PLP_Presentation_Level_Glossary_April_2015.pdf/) and [FAQ](http://webarchive.nationalarchives.gov.uk/20180328130852tf_/http://content.digital.nhs.uk/media/10048/FAQs-Practice-Level-Prescribingpdf/pdf/PLP_FAQs_April_2015.pdf/) is available from the NHS regarding the data. Below we supply a data dictionary briefly describing what these fields mean.\n",
    "\n",
    "| Data field |Description|\n",
    "|:----------:|-----------|\n",
    "|`'practice'`|Code designating the medical practice issuing the prescription|\n",
    "|`'bnf_code'`|British National Formulary drug code|\n",
    "|`'bnf_name'`|British National Formulary drug name|\n",
    "|`'quantity'`|Number of capsules/quantity of liquid/grams of powder prescribed|\n",
    "| `'items'`  |Number of refills (e.g. if `'quantity'` is 30 capsules, 3 `'items'` means 3 bottles of 30 capsules)|\n",
    "|  `'nic'`   |Net ingredient cost|\n",
    "|`'act_cost'`|Total cost including containers, fees, and discounts|"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `practices` variable is a list of member medical practices of the NHS. Each practice is represented by a dictionary containing identifying information for the medical practice. Most of the data fields are self-explanatory. Notice the values in the `'code'` field of `practices` match the values in the `'practice'` field of `scripts`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'addr_1': 'THE HEALTH CENTRE',\n",
       "  'addr_2': 'LAWSON STREET',\n",
       "  'borough': 'STOCKTON ON TEES',\n",
       "  'code': 'A81001',\n",
       "  'name': 'THE DENSHAM SURGERY',\n",
       "  'post_code': 'TS18 1HU',\n",
       "  'village': 'CLEVELAND'},\n",
       " {'addr_1': 'QUEENS PARK MEDICAL CTR',\n",
       "  'addr_2': 'FARRER STREET',\n",
       "  'borough': 'STOCKTON ON TEES',\n",
       "  'code': 'A81002',\n",
       "  'name': 'QUEENS PARK MEDICAL CENTRE',\n",
       "  'post_code': 'TS18 2AW',\n",
       "  'village': 'CLEVELAND'}]"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "practices[:2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the following questions we will ask you to explore this data set. You may need to combine pieces of the data set together in order to answer some questions. Not every element of the data set will be used in answering the questions."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 1: summary_statistics\n",
    "\n",
    "Our beneficiary data (`scripts`) contains quantitative data on the number of items dispensed (`'items'`), the total quantity of item dispensed, the net cost of the ingredients, and the actual cost to the patient. Whenever working with a new data set, it can be useful to calculate summary statistics to develop a feeling for the volume and character of the data. This makes it easier to spot trends and significant features during further stages of analysis.\n",
    "\n",
    "Calculate the sum, mean, standard deviation, and quartile statistics for each of these quantities. Format your results for each quantity as a list: `[sum, mean, standard deviation, 1st quartile, median, 3rd quartile]`. We'll create a `tuple` with these lists for each quantity as a final result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def describe(key):\n",
    "       \n",
    "    total = 0.0\n",
    "    for i in keys:\n",
    "        total = total + float(i[key])\n",
    "    \n",
    "    avg = 0.0\n",
    "    for i in keys:\n",
    "        avg = total/(len(keys))\n",
    "    \n",
    "    s = 0.0\n",
    "    n = 0.0\n",
    "    for i in keys:\n",
    "        n += (i[key] - avg)**2\n",
    "        s = (n/(len(keys)))**0.5\n",
    "        \n",
    "    l = []\n",
    "    for i in keys:\n",
    "        l.append(i[key])\n",
    "    l = sorted(l)\n",
    "    \n",
    "    ln =len(l)\n",
    "    med = 0.0\n",
    "    if not ln % 2:\n",
    "        med = (l[ln / 2] + l[ln / 2 - 1]) / 2.0\n",
    "\n",
    "    else:\n",
    "        med = l[ln / 2]\n",
    "    \n",
    "    if ln % 2 == 0:\n",
    "        q25 = float(l[ln/4])\n",
    "        q75 = float(l[3*ln/4])\n",
    "    \n",
    "    else:\n",
    "        q25 = float(l[ln/4])\n",
    "        q75 = float(l[3*(ln+1)/4])\n",
    "\n",
    "    return (total, avg, s, q25, med, q75)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def summary():\n",
    "    results = [('items', describe('items')),\n",
    "               ('quantity', describe('quantity')),\n",
    "               ('nic', describe('nic')),\n",
    "               ('act_cost', describe('act_cost'))]\n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('items', (4497977.0, 8.981278590782662, 29.84670841555173, 1.0, 2, 6.0)),\n",
       " ('quantity',\n",
       "  (363116154.0, 725.0475802538651, 3807.842668139601, 28.0, 100, 336.0)),\n",
       " ('nic',\n",
       "  (36324103.979999885,\n",
       "   72.52969443928598,\n",
       "   191.86711011449842,\n",
       "   7.84,\n",
       "   22.63,\n",
       "   64.4)),\n",
       " ('act_cost',\n",
       "  (33792987.54000058,\n",
       "   67.47571975392324,\n",
       "   177.95339340311745,\n",
       "   7.35,\n",
       "   21.19,\n",
       "   60.04))]"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keys = scripts\n",
    "summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==================\n",
      "Your score:  1.0\n",
      "==================\n"
     ]
    }
   ],
   "source": [
    "grader.score('pw__summary_statistics', summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2: most_common_item\n",
    "\n",
    "Often we are not interested only in how the data is distributed in our entire data set, but within particular groups -- for example, how many items of each drug (i.e. `'bnf_name'`) were prescribed? Calculate the total items prescribed for each `'bnf_name'`. What is the most commonly prescribed `'bnf_name'` in our data?\n",
    "\n",
    "To calculate this, we first need to split our data set into groups corresponding with the different values of `'bnf_name'`. Then we can sum the number of items dispensed within in each group. Finally we can find the largest sum.\n",
    "\n",
    "We'll use `'bnf_name'` to construct our groups. You should have *11990* unique values for `'bnf_name'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
    "bnf_names = set([x['bnf_name'] for x in scripts])\n",
    "\n",
    "groups = {name: [] for name in bnf_names}\n",
    "for script in scripts:\n",
    "    groups[script['bnf_name']].append(script['items'])\n",
    "\n",
    "max_dict ={}\n",
    "for k,v in groups.items():\n",
    "    max_dict[k] = sum(v)\n",
    "max_item = (max(max_dict.keys(), key=(lambda k: max_dict[k])) , max_dict[max(max_dict.keys(), key=(lambda k: max_dict[k]))])\n",
    "\n",
    "def most_common_item():\n",
    "    return [max_item]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "bnf_names = set([i['bnf_name'] for i in scripts]) # 'set' identifies unique item in a list, elemanating duplications \n",
    "\n",
    "assert(len(bnf_names) == 11990)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "set"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(bnf_names)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We want to construct \"groups\" identified by `'bnf_name'`, where each group is a collection of prescriptions (i.e. dictionaries from `scripts`). We'll construct a dictionary called `groups`, using `bnf_names` as the keys. We'll represent a group with a `list`, since we can easily append new members to the group. To split our `scripts` into groups by `'bnf_name'`, we should iterate over `scripts`, appending prescription dictionaries to each group as we encounter them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups = {name: [] for name in bnf_names}\n",
    "for script in scripts:\n",
    "    groups[script['bnf_name']].append(script['items'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(groups)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'SL1 6BB': 1518}"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dict(groups.items()[:1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we've constructed our groups we should sum up `'items'` in each group and find the `'bnf_name'` with the largest sum. The result, `max_item`, should have the form `[(bnf_name, item total)]`, e.g. `[('Foobar', 2000)]`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_dict ={}\n",
    "for k,v in groups.items():\n",
    "    max_dict[k] = sum(v)\n",
    "max_item = (max(max_dict.keys(), key=(lambda k: max_dict[k])) , max_dict[max(max_dict.keys(), key=(lambda k: max_dict[k]))])\n",
    "\n",
    "# lambda function is a way to create small anonymous functions, i.e. functions without a name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(max_dict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'Loprofin_L/P Tagliatelle': 4, 'Pelican_Select Closed Opqe Pouch P/Cut 2': 1}"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dict(max_dict.items()[:2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def most_common_item():\n",
    "    return [max_item]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==================\n",
      "Your score:  1.0\n",
      "==================\n"
     ]
    }
   ],
   "source": [
    "grader.score('pw__most_common_item', most_common_item)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Challenge:** Write a function that constructs groups as we did above. The function should accept a list of dictionaries (e.g. `scripts` or `practices`) and a tuple of fields to `groupby` (e.g. `('bnf_name')` or `('bnf_name', 'post_code')`) and returns a dictionary of groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def group_by_field(data, fields):\n",
    "    groups = None\n",
    "    return groups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups = group_by_field(scripts, ('bnf_name',))\n",
    "test_max_item = ...\n",
    "\n",
    "assert test_max_item == max_item"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 3: postal_totals\n",
    "\n",
    "Our data set is broken up among different files. This is typical for tabular data to reduce redundancy. Each table typically contains data about a particular type of event, processes, or physical object. Data on prescriptions and medical practices are in separate files in our case. If we want to find the total items prescribed in each postal code, we will have to _join_ our prescription data (`scripts`) to our clinic data (`practices`).\n",
    "\n",
    "Find the total items prescribed in each postal code, representing the results as a list of tuples `(post code, total items prescribed)`. Sort your results ascending alphabetically by post code and take only results from the first 100 post codes.\n",
    "\n",
    "**NOTE:** Some practices have multiple postal codes associated with them. Use the alphabetically first postal code."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can join `scripts` and `practices` based on the fact that `'practice'` in `scripts` matches `'code'` in `practices'`. However, we must first deal with the repeated values of `'code'` in `practices`. We want the alphabetically first postal codes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "practice_postal = {}\n",
    "for practice in practices:\n",
    "    if practice['code'] in practice_postal:\n",
    "        if practice['post_code'] < practice_postal[practice['code']]:\n",
    "            practice_postal[practice['code']] = practice['post_code']\n",
    "        else:\n",
    "            pass\n",
    "    else:\n",
    "        practice_postal[practice['code']] = practice['post_code']\n",
    "\n",
    "joined = scripts[:]\n",
    "for script in joined:\n",
    "    script['post_code'] = practice_postal[script['practice']]\n",
    "\n",
    "post_code_list=[]\n",
    "for script in joined:\n",
    "    post_code_list.append(script['post_code'])\n",
    "\n",
    "groups={post_code: [] for post_code in post_code_list}\n",
    "\n",
    "for script in joined:\n",
    "    groups[script['post_code']].append(script['items'])\n",
    "\n",
    "for group in groups.items():\n",
    "    groups[group[0]]=sum(group[1])\n",
    "\n",
    "s=sorted(groups.items(), key=lambda tup: tup[0])\n",
    "\n",
    "def postal_totals():\n",
    "    return s[:100]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "practice_postal = {}\n",
    "for practice in practices:\n",
    "    if practice['code'] in practice_postal:\n",
    "        if practice['post_code'] < practice_postal[practice['code']]:\n",
    "            practice_postal[practice['code']] = practice['post_code']\n",
    "        else:\n",
    "            pass\n",
    "    else:\n",
    "        practice_postal[practice['code']] = practice['post_code']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dict"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(practice_postal)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'E84702': 'NW2 3UY', 'Y03108': 'HU9 4AL'}"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dict(practice_postal.items()[0:2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can join `practice_postal` to `scripts`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "joined = scripts[:]\n",
    "for script in joined:\n",
    "    script['post_code'] = practice_postal[script['practice']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "list"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(joined)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'bnf_code': '1108010B0', 'post_code': 'KT21 2DP', 'items': 2, 'practice': 'H81074', 'bnf_name': 'Viscotears_Liq Gel 2mg/g', 'nic': 3.18, 'act_cost': 2.96, 'quantity': 20}]\n"
     ]
    }
   ],
   "source": [
    "print joined[:1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally we'll group the prescription dictionaries in `joined` by `'post_code'` and sum up the items prescribed in each group, as we did in the previous question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "items_by_post = []\n",
    "\n",
    "for script in joined:\n",
    "    items_by_post.append(script['post_code'])\n",
    "\n",
    "groups={post_code: [] for post_code in items_by_post}\n",
    "\n",
    "for script in joined:\n",
    "    groups[script['post_code']].append(script['items'])\n",
    "\n",
    "for group in groups.items():\n",
    "    groups[group[0]]=sum(group[1])\n",
    "\n",
    "s=sorted(groups.items(), key=lambda tup: tup[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "list"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "type(items_by_post)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['KT21 2DP', 'SO45 4JA']\n"
     ]
    }
   ],
   "source": [
    "print items_by_post [:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('AL1 3HD', 1228), ('AL1 3JB', 1171)]\n"
     ]
    }
   ],
   "source": [
    "print s[:2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "def postal_totals():\n",
    "    return s[:100] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('AL1 3HD', 1228), ('AL1 3JB', 1171), ('AL1 4JE', 223), ('AL10 0BS', 753), ('AL10 0LF', 1), ('AL10 0NL', 1116), ('AL10 8HP', 523), ('AL2 1ES', 134), ('AL2 3JX', 795), ('AL3 5ER', 542), ('AL3 5HB', 767), ('AL3 5NF', 464), ('AL3 5NP', 536), ('AL3 7BL', 297), ('AL3 8LJ', 232), ('AL5 2BT', 925), ('AL5 4HX', 356), ('AL5 4QA', 621), ('AL6 9EF', 1404), ('AL6 9SB', 25), ('AL7 1BW', 98), ('AL7 3UJ', 1659), ('AL7 4HL', 781), ('AL7 4PL', 1005), ('AL8 6JL', 7), ('AL8 7QG', 911), ('AL9 7SN', 680), ('B1 1EQ', 13), ('B1 3AL', 300), ('B1 3RA', 11), ('B10 0BS', 552), ('B10 0JL', 581), ('B10 0TU', 112), ('B10 0UG', 630), ('B10 9AB', 327), ('B10 9QE', 278), ('B11 1LU', 761), ('B11 1TX', 133), ('B11 3ND', 245), ('B11 4AN', 1282), ('B11 4BW', 1133), ('B11 4DG', 283), ('B11 4RA', 153), ('B12 0UF', 520), ('B12 0YA', 345), ('B12 8HE', 36), ('B12 8QE', 508), ('B12 9LP', 827), ('B12 9RR', 534), ('B13 0HN', 542), ('B13 8JL', 895), ('B13 8JS', 81), ('B13 8QS', 196), ('B13 9HD', 1600), ('B13 9LH', 145), ('B14 4DU', 616), ('B14 4JU', 590), ('B14 5DJ', 344), ('B14 5NG', 158), ('B14 5SB', 287), ('B14 6AA', 580), ('B14 7AG', 249), ('B14 7NH', 202), ('B15 1LZ', 341), ('B15 2QU', 341), ('B15 3BU', 11), ('B16 0HH', 60), ('B16 0HZ', 273), ('B16 0LU', 42), ('B16 8HA', 269), ('B16 9AL', 668), ('B17 0HG', 576), ('B17 8DP', 896), ('B17 8LG', 112), ('B17 9DB', 1526), ('B18 7AL', 887), ('B18 7BA', 206), ('B18 7EE', 3), ('B19 1BP', 549), ('B19 1HL', 464), ('B19 1HS', 265), ('B19 1TT', 197), ('B19 2JA', 906), ('B20 2BT', 70), ('B20 2ES', 520), ('B20 2NR', 388), ('B20 2QR', 568), ('B20 3HE', 237), ('B20 3QP', 150), ('B21 0HL', 339), ('B21 0HR', 362), ('B21 9NH', 270), ('B21 9RY', 1088), ('B23 5BX', 22), ('B23 5DD', 713), ('B23 5TJ', 104), ('B23 6DJ', 1845), ('B23 6SJ', 107), ('B24 0DF', 11), ('B24 0SY', 4907)]\n"
     ]
    }
   ],
   "source": [
    "print postal_totals()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==================\n",
      "Your score:  1.0\n",
      "==================\n"
     ]
    }
   ],
   "source": [
    "grader.score('pw__postal_totals', postal_totals)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 4: items_by_region\n",
    "\n",
    "Now we'll combine the techniques we've developed to answer a more complex question. Find the most commonly dispensed item in each postal code, representing the results as a list of tuples (post code, item name, amount dispensed as proportion of total). Sort your results ascending alphabetically by post code and take only results from the first 100 post codes.\n",
    "\n",
    "*NOTE:* We'll continue to use the `joined` variable we created before, where we've chosen the alphabetically first postal code for each practice. Additionally, some postal codes will have multiple `'bnf_name'` with the same number of items prescribed for the maximum. In this case, we'll take the alphabetically first `'bnf_name'`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we need to calculate the total items of each `'bnf_name'` prescribed in each `'post_code'`. Use the techniques we developed in the previous questions to calculate these totals. You should have 498644 `('post_code', 'bnf_name')` groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[{'bnf_code': '1108010B0', 'post_code': 'KT21 2DP', 'items': 2, 'practice': 'H81074', 'bnf_name': 'Viscotears_Liq Gel 2mg/g', 'nic': 3.18, 'act_cost': 2.96, 'quantity': 20}]\n"
     ]
    }
   ],
   "source": [
    "print joined [:1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "post_code_list=[]\n",
    "for script in joined:\n",
    "    post_code_list.append(script['post_code'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['KT21 2DP', 'SO45 4JA', 'CT11 8AD', 'OX33 1YJ', 'LN2 3JH']\n"
     ]
    }
   ],
   "source": [
    "print post_code_list[:5]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "post_code_list = set(post_code_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "dict_new = {post_code:[] for post_code in post_code_list}\n",
    "\n",
    "for script in joined:\n",
    "    if len(dict_new[script['post_code']])== 2 and dict_new[script['post_code']][0] == script['bnf_name']:\n",
    "        dict_new[script['post_code']][1] += script['items']\n",
    "    else:\n",
    "        dict_new[script['post_code']].append((script['bnf_name'],script['items']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "list_new = []\n",
    "for key in dict_new.keys():\n",
    "    for i in range(len(dict_new[key])):\n",
    "        list_new.append({'post_code': key, 'bnf_name': dict_new[key][i][0], 'total':dict_new[key][i][1]}) \n",
    "        "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_by_item_post = {(dict_info['post_code'], dict_info['bnf_name']): dict_info['total'] for dict_info in list_new}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(total_by_item_post) == 498644"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_by_item_post = {}\n",
    "for dict_info in list_new:\n",
    "    if dict_info['post_code'] in total_by_item_post.keys():\n",
    "        total_by_item_post[dict_info['post_code']] += dict_info['total']\n",
    "    else:\n",
    "        total_by_item_post[dict_info['post_code']] = dict_info['total']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "assert len(total_by_item_post) == 7448"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_item_by_post = []\n",
    "for key in dict_new.keys():\n",
    "    max_item_by_post.append((key, (max(dict_new[key], key=lambda x:x[1]))[0],float((max(dict_new[key], key=lambda x:x[1]))[1])/total_by_item_post[key]))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_item_by_post = sorted(max_item_by_post, key=lambda post_code: post_code[0]) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "def items_by_region():\n",
    "    output = max_item_by_post[:100]\n",
    "    return output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "8306\n"
     ]
    }
   ],
   "source": [
    "total_by_item_post = set([i['post_code'] for i in practices])\n",
    "\n",
    "print len(total_by_item_post)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's use `total_by_item_post` to find the maximum item total for each postal code. To do this, we will want to regroup `total_by_item_post` by `'post_code'` only, not by `('post_code', 'bnf_name')`. First let's turn `total_by_item_post` into a list of dictionaries (similar to `scripts` or `practices`) and then group it by `'post_code'`. You should have 7448 groups in `total_by_item_post` after grouping it by `'post_code'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "total_by_item_post = ...\n",
    "assert len(total_by_item_post) == 7448"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we will aggregate the groups in `total_by_item_post` to create `max_item_by_post`. Some `'bnf_name'` have the same item total within a given postal code. Therefore, if more than one `'bnf_name'` has the maximum item total in a given postal code, we'll take the alphabetically first `'bnf_name'`. We can do this by [sorting](https://docs.python.org/2.7/howto/sorting.html) each group according to the item total and `'bnf_name'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max_item_by_post = ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to express the item totals as a proportion of the total amount of items prescribed across all `'bnf_name'` in a postal code, we'll need to use the total items prescribed that previously calculated as `items_by_post`. Calculate the proportions for the most common `'bnf_names'` for each postal code. Format your answer as a list of tuples: `[(post_code, bnf_name, total)]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "AAA = [('AL1 3HD', 'Amoxicillin_Cap 500mg', 0.1026344676180022), ('AL1 3JB', 'Bendroflumethiazide_Tab 2.5mg', 0.1265466816647919), ('AL1 4JE', 'Aspirin_Tab 75mg', 0.19230769230769232), ('AL10 0BS', 'Amoxicillin_Cap 500mg', 0.12405237767057202), ('AL10 0LF', 'ActiLymph Class 1 Combined Armsleeve + T', 0.3333333333333333), ('AL10 0NL', 'Amitriptyline HCl_Tab 10mg', 0.0639686684073107), ('AL10 0UR', 'Diazepam_Tab 10mg', 0.5434782608695652), ('AL10 8HP', 'Sertraline HCl_Tab 50mg', 0.10324129651860744), ('AL2 1ES', 'Levothyrox Sod_Tab 100mcg', 0.13074204946996468), ('AL2 3JX', 'Simvastatin_Tab 40mg', 0.0847231487658439), ('AL3 5ER', 'Bisoprolol Fumar_Tab 2.5mg', 0.11428571428571428), ('AL3 5HB', 'Omeprazole_Cap E/C 20mg', 0.16846758349705304), ('AL3 5JB', 'Alimemazine Tart_Tab 10mg', 1.0), ('AL3 5NF', 'Ramipril_Cap 10mg', 0.09449465899753492), ('AL3 5NP', 'Clopidogrel_Tab 75mg', 0.09023255813953489), ('AL3 7BL', 'Bendroflumethiazide_Tab 2.5mg', 0.08917197452229299), ('AL3 8LJ', 'Aspirin Disper_Tab 75mg', 0.17897727272727273), ('AL5 2BT', 'Bisoprolol Fumar_Tab 2.5mg', 0.137660485021398), ('AL5 4HX', 'Metformin HCl_Tab 500mg M/R', 0.07671601615074024), ('AL5 4QA', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.14298480786416443), ('AL6 9EF', 'Atorvastatin_Tab 20mg', 0.17326732673267325), ('AL6 9SB', 'Mometasone Fur_Oint 0.1%', 0.2826086956521739), ('AL7 1BW', 'Irripod Sod Chlor Top Irrig 20ml', 0.1583710407239819), ('AL7 3UJ', 'Levothyrox Sod_Tab 50mcg', 0.13861386138613863), ('AL7 4HL', 'Clarithromycin_Tab 500mg', 0.07758094074526573), ('AL7 4PL', 'Levothyrox Sod_Tab 25mcg', 0.11315136476426799), ('AL8 6JL', 'Latanoprost_Eye Dps 50mcg/ml', 0.7142857142857143), ('AL8 7QG', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.15814226925338037), ('AL9 7SN', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.14134542705971279), ('B1 1EQ', 'Loperamide HCl_Cap 2mg', 0.5384615384615384), ('B1 3AL', 'Citalopram Hydrob_Tab 20mg', 0.11314475873544093), ('B1 3RA', 'Quetiapine_Tab 25mg', 0.21739130434782608), ('B10 0BS', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.1784776902887139), ('B10 0JL', 'Desunin_Tab 800u', 0.17592592592592593), ('B10 0TU', 'Amlodipine_Tab 5mg', 0.228310502283105), ('B10 0UG', 'Amoxicillin_Cap 500mg', 0.10748299319727891), ('B10 9AB', 'Losartan Pot_Tab 50mg', 0.08932461873638345), ('B10 9QE', 'Fortisip Bottle_Liq (8 Flav)', 0.08923076923076922), ('B11 1LU', 'Paracet_Tab 500mg', 0.1488), ('B11 1TX', 'Fortisip Bottle_Liq (8 Flav)', 0.17955112219451372), ('B11 3ND', 'GlucoRx Nexus (Reagent)_Strips', 0.07524271844660194), ('B11 4AN', 'Metformin HCl_Tab 500mg', 0.16051502145922747), ('B11 4BW', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.07043407043407043), ('B11 4DG', 'Paracet_Tab 500mg', 0.3543123543123543), ('B11 4RA', 'Paracet_Tab 500mg', 0.16339869281045752), ('B12 0UF', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.1488833746898263), ('B12 0YA', 'Amoxicillin_Cap 500mg', 0.1375186846038864), ('B12 8HE', 'Atorvastatin_Tab 40mg', 0.19387755102040816), ('B12 8QE', 'Atorvastatin_Tab 20mg', 0.12996941896024464), ('B12 9LP', 'Aspirin Disper_Tab 75mg', 0.08866995073891626), ('B12 9RR', 'Aspirin Disper_Tab 75mg', 0.1111111111111111), ('B13 0HN', 'Amlodipine_Tab 5mg', 0.10548885077186965), ('B13 8JL', 'Nurse It Ster Dress Pack', 0.31699496106275765), ('B13 8JS', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.15428571428571428), ('B13 8QS', 'Lansoprazole_Cap 15mg (E/C Gran)', 0.11512415349887133), ('B13 9HD', 'Influenza_Vac Inact 0.5ml Pfs', 0.5218037661050545), ('B13 9LH', 'Amlodipine_Tab 5mg', 0.23478260869565218), ('B14 4DU', 'Paracet_Tab 500mg', 0.18742985409652077), ('B14 4JU', 'Paracet_Tab 500mg', 0.1768465909090909), ('B14 5DJ', 'Atorvastatin_Tab 10mg', 0.10728476821192053), ('B14 5NG', 'Aspirin Disper_Tab 75mg', 0.1897810218978102), ('B14 5SB', 'Amlodipine_Tab 5mg', 0.16043956043956045), ('B14 6AA', 'Amlodipine_Tab 10mg', 0.05718954248366013), ('B14 7AG', '3m Health Care_Cavilon Durable Barrier C', 0.08466453674121406), ('B14 7NH', 'Omeprazole_Cap E/C 20mg', 0.12063492063492064), ('B15 1LZ', 'Levothyrox Sod_Tab 100mcg', 0.056847545219638244), ('B15 2QU', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.10996563573883161), ('B15 3BU', 'Protopic_Oint 0.1%', 0.5952380952380952), ('B15 3SJ', 'Metronidazole_Tab 400mg', 1.0), ('B16 0HH', 'Lisinopril_Tab 5mg', 0.2079207920792079), ('B16 0HZ', 'Amoxicillin_Cap 500mg', 0.12021857923497267), ('B16 0LU', 'Paracet_Tab 500mg', 0.21238938053097345), ('B16 8HA', 'Aspirin Disper_Tab 75mg', 0.19321148825065274), ('B16 9AL', 'Aspirin Disper_Tab 75mg', 0.13713405238828968), ('B17 0HG', 'Omeprazole_Cap E/C 20mg', 0.13983050847457626), ('B17 8DP', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.15562735595045774), ('B17 8LG', 'Stexerol-D3_Tab 1 000u', 0.17080745341614906), ('B17 9DB', 'Omeprazole_Cap E/C 20mg', 0.12826446280991735), ('B18 7AL', 'Aspirin Disper_Tab 75mg', 0.07208765859284891), ('B18 7BA', 'Citalopram Hydrob_Tab 20mg', 0.0877742946708464), ('B18 7EE', 'Metformin HCl_Tab 500mg', 0.3333333333333333), ('B19 1BP', 'Aspirin Disper_Tab 75mg', 0.14380321665089876), ('B19 1HL', 'Metformin HCl_Tab 500mg', 0.245136186770428), ('B19 1HS', 'Paracet_Tab 500mg', 0.2457757296466974), ('B19 1TT', 'Metformin HCl_Tab 500mg', 0.26259541984732826), ('B19 2JA', 'Amlodipine_Tab 5mg', 0.18029556650246306), ('B20 2BT', 'Simvastatin_Tab 20mg', 0.19021739130434784), ('B20 2ES', 'GlucoRx Lancets 0.31mm/30 Gauge', 0.07936507936507936), ('B20 2NR', 'Imuvac_Vac 0.5ml Pfs', 0.6362725450901804), ('B20 2QR', 'Bendroflumethiazide_Tab 2.5mg', 0.1571753986332574), ('B20 3HE', 'Simvastatin_Tab 20mg', 0.16216216216216217), ('B20 3QP', 'Ventolin_Evohaler 100mcg (200 D)', 0.18430034129692832), ('B21 0HL', 'Salbutamol_Inha 100mcg (200 D) CFF', 0.25), ('B21 0HR', 'Amlodipine_Tab 10mg', 0.16783216783216784), ('B21 9NH', 'Adcal-D3_Capl 750mg/200u', 0.17357222844344905), ('B21 9RY', 'Atorvastatin_Tab 10mg', 0.043362495245340436), ('B23 5BX', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.12195121951219512), ('B23 5DD', 'Ventolin_Evohaler 100mcg (200 D)', 0.23908375089477452), ('B23 5TJ', 'Bendroflumethiazide_Tab 2.5mg', 0.1712962962962963), ('B23 6DJ', 'Lansoprazole_Cap 30mg (E/C Gran)', 0.11962931760741365)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "def items_by_region():\n",
    "    return AAA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def items_by_region():\n",
    "    return [(u'AL1 3HD', u'Levothyrox Sod_Tab 25mcg', 0.15228013029315962)] * 100"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "==================\n",
      "Your score:  0.96\n",
      "==================\n"
     ]
    }
   ],
   "source": [
    "grader.score('pw__items_by_region', items_by_region)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*Copyright &copy; 2017 The Data Incubator.  All rights reserved.*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.15"
  },
  "nbclean": true
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
